EDAV Project

4/26/2018

—-Analysis of U.S Tax revenue

1. Introduction

a. The reason that we chose this topic is that:

U.S. is the largest economy in the world in terms of nominal GDP for more than ten years, we are so interested to know what drives U.S. goes so far, what is the patterns of tax revenue in different states and years. To analyze tax revenue patterns in U.S., we could get better understanding of U.S. economy.

b. Our team members are:

Jinhan Cheng(jc4834), Jing Wu(jw3233), Zhaoyue Bi(zb2225), we collaborate together to research topic, collect data and analyze data.


2. Description of Data

a. How the data was collected and accessed:

i. The data was collected from the government, we find the data in the United States Census Bureau website, which provides widely population and economy public data about U.S. The data we use is Annual Survey of State Government Tax Collections, which consists of 50 states' tax revenue in 25 classifications from 1951 to 2016.

  1. Also, to analyze the specific situation in NYC, we get a more detailed tax revenue data from 1980 to 2014 in New York City, from the NYC Open Data website.

  2. To analyze the average tax revenue, we introduce another data that shows the estimated population in each state. We find the data in the United States Census Bureau website.

b. Noteworthy features of the data:

  1. The data was collected from the government, which is trust worthy and accurate.

  2. The data consists of information in each state and tax classification from 1951 to 2016, which enables us to get both bird view and detailed information about the tax income.

  3. To analyze the data of tax revenue, we could develop a deep understanding of American economy and thus solve our questions. This data's problem is as U.S. do not conduct another survey of population after 2000, we could only estimate the population in 2016 based on 2000 year’s survey.


3. Analysis of Data Quality

a. The quality of the data:

The data has some missing values, though the overall quality of the data is still good. For this data, the original data use “0” to represent missing value, it could be messed with the real “0” value which means that there might be states don’t collect such kind of tax, or say tax free. However, we couldn’t address this problem, and we could only use “0” as missing value. On the one hand, some features have few missing values in some state, which have few impact. On the other hand, for some data have a lot missing values, we could represent them by using other data, which will also give us enough information.

b. The data has some missing data:

i.Based on the features, we could identify them as follows:

  1. More than a half states do not have the value of T99(Taxes, NEC)

  2. 1/3 data of T50(Death and Gift Taxes), T51(Documentary and Stock Transfer Taxes), T53(Severance Taxes) are missing, but as C130 represents the sum of them, and we could get C130 without any missing value.

  3. 1/3 data of T27(Public Utilities License) and T01(Property Taxes) are missing, which means a that we may lose some accuracy in these features.

ii.Based on the missing patterns, we could identify 4 patterns having more data:

Based on the missing patterns, we could identify 4 patterns having more data:

  1. Fortunately, the first pattern is all data could be collected without missing.

  2. T53, T99 are missing, which means Public Utilities License, Public Utilities License are missing.

  3. T27, T53, T99 are missing, which means Public Utilities License, Severance Taxes, Public Utilities License are missing.

  4. T14, T51, T99 are missing, which means Pari-mutuels Sales Tax, Documentary and Stock Transfer Taxes, Taxes, NEC are missing.

c. For easy understanding the meaning of data, we give the structure of it:

We could see from the structure of data, total taxes is consist of 5 parts: property taxes, total sales&gross receipts taxes, total license taxes, total income taxes, and total other taxes. Each of them are consist of some other values in the table. Thus it is convenient for us to see these five features to get a brief understanding of tax revenue in a state.


To clearly get information of each part, we show the name of each classification:


4. Main Analysis (Exploratory Data Analysis)

a. First, we compare patterns in different states:

  1. First, we introduce the graph consist of 5 majority tax revenues in each state. We want to draw some general information by comparison.

From this graph, we could see several patterns in different states. VT depends on T01(Property Tax) more than other states. More than half states tax revenue mainly come from C129 and C107, that is Tot Sales & Gr Rec Tax and total income taxes. ND, AK, WY rely on C130(Total Other Taxes) much more than other states More than 90% tax revenue in FL and TX’ come from C107(Total Other Taxes), and other tax revenues only occupy 10% .

  1. As C107(Tot Sales & Gr Rec Tax) = C109(Total Select Sales Tax) + T09(Total Gen Sales Tax), we divided C107 into two parts and compare them based on states:


We can see from the picture that CA, TX, FL's sales tax revenue is much more than other states; NY, WA, IL, OH follows.

From the picture above, we know that for majority states, T09(total gen sales tax) is more than C109(total select sales tax).

Meanwhile, we find that NH, OR, DE and AK do not have the T09 data, which means it may contribute all sales taxes to select sales tax, which is not accurate.

  1. Then, we analyze the T11(Amusement Tax) and T14(Parimutuels Tax), we want to know if LA is the biggest player.

We find that in fact LA’s tax revenue in Parimutuels is not the largest. Instead, PA, NV, IL are the top 3 players, LA ranks No.4.
At the same time, Amusement tax revenues is nothing compared with Parimutuels, NY and PA are the most players.

  1. Then, we want to get some information from income taxes, which is consist of T40(Individual Income Tax) and T41(Corp Net Income Tax)

From the graph we see that CA and NY have the highest revenue for individual income tax, which means people in these two states have a higher quality life.

  1. Finally, we see an interesting feature, T50(Death and Gift Tax), which also means where are richest people tend to live and give their heritage to descendant.

$title
[1] "US Tax T50 of 2016"

$subtitle
NULL

attr(,"class")
[1] "labels"

From the graph we know in NY, NJ, PA have more tax revenue in Death and Gift Tax. But we need to pay attention to the missing data of this item, many state do not have this item, so the result may be biased.

b. Second, we compare each tax revenues:

  1. As we know the total tax revenue of each state:


But we also want to know the tax paid by each person, which could eliminate the effect of population.
To address this problem, we will introduce a populate estimation in section 6, the interactive part, which is estimated by the exact number in 2010, though may not represent the real number of population in each state. As the fast growing states tend to attract more people than previous, like CA, the average tax calculated may be higher than the real number.

c. Third, let’s look at NYC tax revenue:

Q: Do tax revenue drops or increases have any meaning? What could be the reasons for that drops or increases? From this graph, we can see that the total tax revenue dropped since 2008 and increased back around 2010. It is consistent with the recession happened around 2008 and the recovery after 2010. Also, there is another drop around early 2000, which corresponds to the recession at that time.

Q: Any consistent pattern from 1980 to 2013 for each categories of the tax revenue at NYC? Yes, we can see from this graph that all the tax revenue from each categories increases from 1980 to 2013 as expected. First reason is inflation. Second reason is that the economics has been growing and America is able to produce more, spend more, and hence increasing in tax revenue.

Q: Is tax collection one of the primary source of NYC government’s income? Yes, it is. We can see from above graph that more than half of the NYC revenues for one year are from Taxes. One of the primary source of government’s income are from collecting taxes. Therefore, it is really hard for the government to reducing taxes. For further analysis, we can pull out some data about how many taxes are collected from each classes and we can see which class contributes the most in tax revenues.

Q: Which category contributes the highest amount of tax revenues and non-tax revenues from 1980 to 2013? We pulled dataset of NYC Tax Revenues and NYC Non-Tax Revenues from 1980 to 2013. For Tax Revenues, property tax contributes the highest amount from 1980 to 2013. For Non-Tax Revenues, water and sewer contributes the highest amount from 1980 to 2013. The pattern are pretty consistent from 1980 to 2013 for both Tax Revenues and Non-Tax Revenues.


5. Executive Summary (Presentation-style)

a. First, we introduce the bird view of tax revenue in America:

From the perspective of geography:

From this graph, we could see several patterns in different states. VT depends on T01(Property Tax) more than other states. More than half states tax revenue mainly come from C129 and C107, that is Tot Sales & Gr Rec Tax and total income taxes. ND, AK, WY rely on C130(Total Other Taxes) much more than other states More than 90% tax revenue in FL and TX’ come from C107(Total Other Taxes), and other tax revenues only occupy 10% .

b. Second, we want to analyze an example about the tax revenue in one state, NY:

Q: Any trends for tax revenue from 1980 to 2013? From this graph, we can see that the total tax revenue dropped since 2008 and increased back around 2010. It is consistent with the recession happened around 2008 and the recovery after 2010. Also, there is another drop around early 2000, which corresponds to the recession at that time.

c. Third, we want to analyze patterns in different states:

d. Finally, we want to give some insights about the tax revenue in America:

VT depends on T01(Property Tax) more than other states. More than half states tax revenue mainly come from C129 and C107, that is Tot Sales & Gr Rec Tax and total income taxes. VT depends on T01(Property Tax) more than other states. More than half states tax revenue mainly come from C129 and C107, that is Tot Sales & Gr Rec Tax and total income taxes. VT depends on T01(Property Tax) more than other states. More than half states tax revenue mainly come from C129 and C107, that is Tot Sales & Gr Rec Tax and total income taxes.

VT depends on T01(Property Tax) more than other states. More than half states tax revenue mainly come from C129 and C107, that is Tot Sales & Gr Rec Tax and total income taxes.

6. Interactive Component

To get a overall understanding of the tax revenues in U.S., we use two interacitve components to help you get insights.

  1. A geographical tax revenue picture in 50 states.
  2. A time serious tax revenue picture from 1951 to 2016.

a. An interactive Network for tax codes:

b. A geographical tax revenue picture in 50 states:

First, we introduced population dataset.

As we have seen in the plots before, CA always has taxes collection more than many of the other states. looking at this plot, we could find that CA also has the largest population, so our assumption that introducing population into our research into the tax collection problem is reasonable.

Then, we give a comparation of tax revenues based on the geographical locations in 2016, which tells us states contribute most to the tax revenue.

For further information about the state: we could put mouse over the state and it shows the most important tax revenue data: C107(Total Sales & Gr Rec Tax), C118(Total License Taxes), C129(Total Income Taxes) and C130(Total Other Taxes) in the state. In this way, we could get more detailed components of the tax revenue in the state.

c. A time serious tax revenue picture from 1980 to 2014 in NYC:

You can choose lines of revenue kinds as you want. Moving the mouse on the line will present you the date and tax collection result.


7. Conclusion

a. Limitations:

i. As the data still has some missing values, which may impact the accuracy of results.

  1. The data consists of information from 1951 to 2016. Althought we could get enought data, the economy changes much in such a long time and we have not explored all details in this data.

  2. As the data only contains total tax revenue, we have not taken other factors in to consideration, it may only represents the total revenue and fail to represent.

  3. As each state has different tax policy, and the rate of tax is also depends on each state, this will impact the comparision in a way.

  4. Some states do not need to pay tax in certain classification, we define them as missing data, but in fact it is not missing.

b. Future directions:

  1. We want to get more information about the differene of each state's tax policy, and we could introduce some methods to eliminate the impact of it and compare in a more fair way.

  2. We want to find a special period, such as depression era in 1998 and 2008, and find how the taxes changes.

  3. We want to explore more details in other years besides 2016, and find how the patterns change overtime.

  4. We could chose more models to draw the graph, which will enrich the report.

  5. We want to collect more related data in specific area to analyze questions more deeply.